import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
from pandas.plotting import scatter_matrix
%matplotlib inline
housing = pd.read_excel("helsinki_house_price_cleaned.xls", index_col=0)
housing.head()
| Link | Address | Price | Size | Year | House_type | Description | Total_rooms | Latitude | Longitude | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | https://www.etuovi.com/kohde/21189890?haku=M16... | Rosendalinrinki 3, Tammisto, Vantaa | 227000 | 49.5 | 2009 | Kerrostalo | 2h + kt + s | 2.0 | 60.268798 | 24.968812 |
| 1 | https://www.etuovi.com/kohde/7971959?haku=M160... | Kanttiinitie 6 B4, Painiitty, Espoo | 385900 | 92.0 | 2022 | Rivitalo | 4h + kt + s | 4.0 | 60.237447 | 24.834287 |
| 2 | https://www.etuovi.com/kohde/21772126?haku=M16... | Konalantie 4, Pitäjänmäki, Helsinki | 205000 | 66.0 | 1963 | Kerrostalo | 3h + k | 3.0 | 60.225724 | 24.853692 |
| 3 | https://www.etuovi.com/kohde/1287367?haku=M160... | Ripusuontie 64 A 2, Länsi-Pakila, Helsinki | 545000 | 92.0 | 2020 | Rivitalo | 3-4h, k, kph / 2wc, khh, sauna, vh 92 m2 + lä... | 3.0 | 60.247235 | 24.945488 |
| 4 | https://www.etuovi.com/kohde/20553097?haku=M16... | Kievarinraitti 2, Viherlaakso, Espoo | 149000 | 56.0 | 1980 | Kerrostalo | 2h + k + kph + wc (yht.järj.muk. 2h + k) | 2.0 | 60.229167 | 24.742285 |
housing.hist(bins=50, figsize=(10, 11));
As we can see, most distributions except for Latiude and Longitude are skewed:
Based on the histogram for the price, I am curious about the presence of some very rara extreme high prices.
housing[housing.Price>3e6]
| Link | Address | Price | Size | Year | House_type | Description | Total_rooms | Latitude | Longitude | |
|---|---|---|---|---|---|---|---|---|---|---|
| 76 | https://www.etuovi.com/kohde/7964542?haku=M160... | Krogiuksentie 10, Kuusisaari, Helsinki | 3100000 | 282.8 | 2011 | Omakotitalo | 7h, k, rh, 2kph, 2 erill wc, saunaos, kuntosa... | 7.0 | 60.184745 | 24.863660 |
| 435 | https://www.etuovi.com/kohde/7970077?haku=M160... | Krogiuksentie 10 F, Kuusisaari, Helsinki | 5000000 | 347.0 | 2011 | Omakotitalo | 8 h, k, rt + useita kylpyhuoneita ja wc-tiloj... | 8.0 | 60.184379 | 24.864015 |
| 510 | https://www.etuovi.com/kohde/20362550?haku=M16... | Yrjönkatu 11, Keskusta, Helsinki | 4600000 | 324.0 | 1898 | Kerrostalo | 5h + k + s | 5.0 | NaN | NaN |
| 633 | https://www.etuovi.com/kohde/7968394?haku=M160... | Kuusiniementie 9, Kuusisaari, Helsinki | 9300000 | 670.0 | 2021 | Omakotitalo | 7 h, k, 4 kph, vh, khh, 2 saunaa, uima allas,... | 7.0 | 60.187962 | 24.866627 |
| 1157 | https://www.etuovi.com/kohde/21104110?haku=M16... | Lars Sonckin tie 14, Kulosaari, Helsinki | 3200000 | 443.9 | 2012 | Erillistalo | 12 h, k, 2 saunaa, 2 oh, 3 vh, khh, 4 autopai... | 12.0 | 60.181584 | 25.015084 |
| 2220 | https://www.etuovi.com/kohde/7969579?haku=M160... | Westendintie 12 C, Westend, Espoo | 3240000 | 330.0 | 2003 | Omakotitalo | 7 h, k, aulatila, khh, 3xkph, s, autotalli, l... | 7.0 | 60.164595 | 24.789785 |
| 2408 | https://www.etuovi.com/kohde/20917802?haku=M16... | Armas Lindgrenin tie 12, Kulosaari, Helsinki | 3800000 | 211.5 | 1938 | Omakotitalo | NaN | NaN | 60.180394 | 25.014137 |
| 3967 | https://www.etuovi.com/kohde/n86779?haku=M1606... | Laajasalontie 67, Hevossalmi, Helsinki | 3200000 | 276.0 | 2004 | Paritalo | 5-6 h + keittiö + 2 kph + 2 WC + kodinhoitohu... | 5.0 | 60.160599 | 25.048773 |
| 4246 | https://www.etuovi.com/kohde/2228665?haku=M160... | Kuusisaari, Kuusisaari, Helsinki | 6800000 | 900.0 | 1977 | Omakotitalo | Ultimate luxury - Kuusisaaressa | NaN | 60.183088 | 24.858285 |
| 4253 | https://www.etuovi.com/kohde/20027837?haku=M16... | Piritanaukio 3, Eira, Helsinki | 4350000 | 233.0 | 2008 | Kerrostalo | 3-4 mh, oh, k, 2 kph, s, 2 wc, khh, 2 vh, 3 p | 3.0 | NaN | NaN |
| 4368 | https://www.etuovi.com/kohde/7947232?haku=M160... | Puistokatu 11 A, Ullanlinna, Helsinki | 3890000 | 340.0 | 1921 | Kerrostalo | 7 h + k + 2 palvelijan huonetta | 7.0 | 60.157710 | 24.945162 |
| 4389 | https://www.etuovi.com/kohde/20041445?haku=M16... | Bulevardi 14 A 60, Kamppi, Helsinki | 4960000 | 263.0 | 2021 | Kerrostalo | 4H + K | 4.0 | 60.166378 | 24.942920 |
| 4541 | https://www.etuovi.com/kohde/681330?haku=M1606... | Katajaharjuntie 18 A, Lauttasaari, Helsinki | 4500000 | 460.0 | 1996 | Erillistalo | 9h + k + 4kh + 3wc + khh + s + tkh + 3vh + at... | 9.0 | 60.162827 | 24.859260 |
| 5195 | https://www.etuovi.com/kohde/9919816?haku=M160... | Kulosaarentie 18, Kulosaari, Helsinki | 3280000 | 451.0 | 1917 | Omakotitalo | Päärakennus : 12h, keittö, 3 kylpyhuonetta.Po... | NaN | 60.184621 | 25.005714 |
housing[housing.Price==9300000].Link
633 https://www.etuovi.com/kohde/7968394?haku=M160... Name: Link, dtype: object
It is interesting to look at some basic statistics to get an idea of the range of the values for the different attributes.
housing.describe()
| Price | Size | Year | Total_rooms | Latitude | Longitude | |
|---|---|---|---|---|---|---|
| count | 5.228000e+03 | 5228.000000 | 5228.000000 | 5072.000000 | 4584.000000 | 4584.000000 |
| mean | 3.737330e+05 | 72.430260 | 2004.037873 | 2.774448 | 60.221863 | 24.885070 |
| std | 3.529972e+05 | 45.187783 | 27.948402 | 1.218082 | 0.054915 | 0.141540 |
| min | 0.000000e+00 | 12.000000 | 1830.000000 | 1.000000 | 60.124636 | 24.505078 |
| 25% | 2.225650e+05 | 46.000000 | 1985.000000 | 2.000000 | 60.173362 | 24.783111 |
| 50% | 2.987800e+05 | 65.000000 | 2021.000000 | 3.000000 | 60.210752 | 24.880124 |
| 75% | 4.098000e+05 | 85.000000 | 2022.000000 | 3.000000 | 60.261167 | 24.997378 |
| max | 9.300000e+06 | 900.000000 | 2023.000000 | 12.000000 | 60.383783 | 25.226951 |
As is to be expected, the standard deviation is largest for the price as it has the largest range of values. The average appartment is 72 $m^2$, has 3 rooms and costs about 373000 €. Let's look at the median values.
housing.median()
Price 298780.000000 Size 65.000000 Year 2021.000000 Total_rooms 3.000000 Latitude 60.210752 Longitude 24.880124 dtype: float64
The median cost is lower than the mean one as the mean is more susceptible to outliers as the ones in the higher range.
housing[housing.Price==0].Link # there seems to be an error in one of the ads as the price is set to 0 €
588 https://www.etuovi.com/kohde/q46363?haku=M1606... Name: Link, dtype: object
Let's make the most out of the location to see how homogeneous is the spread of our data.
housing.plot(kind="scatter", x="Longitude", y="Latitude", alpha=0.25, figsize=(10,7));
*c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*. Please use the *color* keyword-argument or provide a 2-D array with a single row if you intend to specify the same RGB or RGBA value for all points.
Let's add some more detail to the graph by representing the price of the houses in the color palette.
housing.plot(kind="scatter", x="Longitude", y="Latitude", alpha=0.25,
figsize=(12,7), c="Price", cmap=plt.get_cmap("jet"), colorbar=True)
plt.legend()
No handles with labels found to put in legend.
<matplotlib.legend.Legend at 0x7fa7e0329828>
Since the larger values are very scarce and extremely large, the differences in colour between most of the housing is not easy to spot. Let's look at the lower half of the prices to see better how the price is distributed
housing[housing.Price<2e6].plot(kind="scatter", x="Longitude", y="Latitude", alpha=0.25,
figsize=(12,7), c="Price", cmap=plt.get_cmap("jet"), colorbar=True)
plt.legend()
No handles with labels found to put in legend.
<matplotlib.legend.Legend at 0x7fa7dfb0b438>
By looking only at prices worth less than 2M€ we are able to see more clustering of higher prices in certain areas (especially closer to the shore).
housing.Latitude.mean(), housing.Longitude.mean(),
(60.22186284237546, 24.885069945259353)
from bokeh.io import output_file, output_notebook, show
from bokeh.models import (
GMapPlot, GMapOptions, ColumnDataSource, Circle, LogColorMapper, BasicTicker, ColorBar,
Range1d, PanTool, WheelZoomTool, BoxSelectTool
)
from bokeh.models.mappers import ColorMapper, LinearColorMapper
from bokeh.palettes import Viridis5
map_options = GMapOptions(lat=60.22186, lng=24.88507, map_type="roadmap", zoom=10)
plot = GMapPlot(
x_range=Range1d(), y_range=Range1d(), map_options=map_options
)
plot.title.text = "Hey look! It's a scatter plot on a map!"
# For GMaps to function, Google requires you obtain and enable an API key:
#
# https://developers.google.com/maps/documentation/javascript/get-api-key
#
plot.api_key = "AIzaSyDbk_fhOkmOrE6Fg9AQlLF7H95e0482Ii8"
source = ColumnDataSource(
data=dict(
lat=housing[housing.Price<2e6][housing.Latitude.notnull()].Latitude.tolist(),
lon=housing[housing.Price<2e6][housing.Latitude.notnull()].Longitude.tolist(),
color=housing[housing.Price<2e6][housing.Latitude.notnull()].Price.tolist()
)
)
color_mapper = LinearColorMapper(palette=Viridis5)
circle = Circle(x="lon", y="lat", fill_color={'field': 'color', 'transform': color_mapper}, fill_alpha=0.5, line_color=None)
plot.add_glyph(source, circle)
color_bar = ColorBar(color_mapper=color_mapper, ticker=BasicTicker(),
label_standoff=12, border_line_color=None, location=(0,0))
plot.add_layout(color_bar, 'right')
plot.add_tools(PanTool(), WheelZoomTool(), BoxSelectTool())
output_file("gmap_plot.html")
output_notebook()
show(plot)
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:26: UserWarning: Boolean Series key will be reindexed to match DataFrame index. /usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:27: UserWarning: Boolean Series key will be reindexed to match DataFrame index. /usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:29: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
Projecting the scatter plot over google maps let's us see how the more expensive housing (capped at 2M€) is located at the centre of Helsinki and in some properties with shore access that are not in central parts of the city. Since Google Maps has limited the free option for this representation, we will use GeoPandas instead.
import geopandas as gpd
housing_geo = gpd.GeoDataFrame(
housing, geometry=gpd.points_from_xy(housing.Longitude, housing.Latitude))
housing_geo.head()
| Link | Address | Price | Size | Year | House_type | Description | Total_rooms | Latitude | Longitude | geometry | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | https://www.etuovi.com/kohde/21189890?haku=M16... | Rosendalinrinki 3, Tammisto, Vantaa | 227000 | 49.5 | 2009 | Kerrostalo | 2h + kt + s | 2.0 | 60.268798 | 24.968812 | POINT (24.96881 60.26880) |
| 1 | https://www.etuovi.com/kohde/7971959?haku=M160... | Kanttiinitie 6 B4, Painiitty, Espoo | 385900 | 92.0 | 2022 | Rivitalo | 4h + kt + s | 4.0 | 60.237447 | 24.834287 | POINT (24.83429 60.23745) |
| 2 | https://www.etuovi.com/kohde/21772126?haku=M16... | Konalantie 4, Pitäjänmäki, Helsinki | 205000 | 66.0 | 1963 | Kerrostalo | 3h + k | 3.0 | 60.225724 | 24.853692 | POINT (24.85369 60.22572) |
| 3 | https://www.etuovi.com/kohde/1287367?haku=M160... | Ripusuontie 64 A 2, Länsi-Pakila, Helsinki | 545000 | 92.0 | 2020 | Rivitalo | 3-4h, k, kph / 2wc, khh, sauna, vh 92 m2 + lä... | 3.0 | 60.247235 | 24.945488 | POINT (24.94549 60.24723) |
| 4 | https://www.etuovi.com/kohde/20553097?haku=M16... | Kievarinraitti 2, Viherlaakso, Espoo | 149000 | 56.0 | 1980 | Kerrostalo | 2h + k + kph + wc (yht.järj.muk. 2h + k) | 2.0 | 60.229167 | 24.742285 | POINT (24.74229 60.22917) |
import contextily as ctx
--------------------------------------------------------------------------- ModuleNotFoundError Traceback (most recent call last) <ipython-input-2-90425f271be4> in <module> ----> 1 import contextily as ctx ModuleNotFoundError: No module named 'contextily'
housing_geo.crs = {'init': 'epsg:4326', 'no_defs': True}
housing_geo = housing_geo.to_crs(epsg=3857)
/usr/local/lib/python3.6/dist-packages/pyproj/crs/crs.py:53: FutureWarning: '+init=<authority>:<code>' syntax is deprecated. '<authority>:<code>' is the preferred initialization method. When making the change, be mindful of axis order changes: https://pyproj4.github.io/pyproj/stable/gotchas.html#axis-order-changes-in-proj-6
return _prepare_from_string(" ".join(pjargs))
ax = housing_geo.plot(figsize=(20, 20), column=housing_geo.Price, cmap=plt.get_cmap("jet"), alpha=0.3)
ctx.add_basemap(ax)
ax = housing_geo[housing_geo.Price<=2e6].plot(figsize=(15, 15),
column=housing_geo[housing_geo.Price<=2e6].Price,
cmap=plt.get_cmap("jet"),
alpha=0.3,
legend=True,
legend_kwds={'label': "Price (€)",
'orientation': "horizontal"})
ctx.add_basemap(ax)
plt.title("Price for <2M€ housing in Helsinki, Espoo and Vantaa")
Text(0.5, 1.0, 'Price for <2M€ housing in Helsinki, Espoo and Vantaa')
In the latest map, we can appreciate high density clusters of similar price which likely correspond to new buildings that have been recently built and have just entered the market. It is also easy to see the high value properties located in the center of Helsinki and, in certain cases, spread throughout the shore.
ax = housing_geo[housing_geo.Price>=2e6].plot(figsize=(15, 15),
column=housing_geo[housing_geo.Price>=2e6].Price,
cmap=plt.get_cmap("jet"),
alpha=0.3,
legend=True,
legend_kwds={'label': "Price (€)",
'orientation': "horizontal"})
ctx.add_basemap(ax)
plt.title("Price for >2M€ housing in Helsinki, Espoo and Vantaa")
Text(0.5, 1.0, 'Price for >2M€ housing in Helsinki, Espoo and Vantaa')
Looking at the map, we can see how the most expensive houses are in very different areas with direct access to water and larger living area.
housing[housing.Price>=2e6].describe()
| Price | Size | Year | Total_rooms | Latitude | Longitude | |
|---|---|---|---|---|---|---|
| count | 3.600000e+01 | 36.000000 | 36.000000 | 29.000000 | 29.000000 | 29.000000 |
| mean | 3.352583e+06 | 301.325000 | 1972.166667 | 5.758621 | 60.175067 | 24.939940 |
| std | 1.427768e+06 | 150.705849 | 50.552095 | 2.115507 | 0.013234 | 0.096278 |
| min | 2.090000e+06 | 122.500000 | 1898.000000 | 3.000000 | 60.155061 | 24.789785 |
| 25% | 2.467500e+06 | 218.625000 | 1914.750000 | 4.000000 | 60.164595 | 24.863660 |
| 50% | 2.985000e+06 | 260.250000 | 2000.000000 | 6.000000 | 60.176570 | 24.945162 |
| 75% | 3.410000e+06 | 341.750000 | 2012.500000 | 7.000000 | 60.184621 | 25.014137 |
| max | 9.300000e+06 | 900.000000 | 2022.000000 | 12.000000 | 60.201410 | 25.087778 |
The highest valued properties are located in very different areas. Something the all share is that they have a large size ($301\pm150\, m^2$), more rooms that the average household ($5.7\pm2.1$) and have a diverse set of years of construction ($1972\pm50$).
corr_mat = housing.corr()
corr_mat
| Price | Size | Year | Total_rooms | Latitude | Longitude | |
|---|---|---|---|---|---|---|
| Price | 1.000000 | 0.721781 | -0.102507 | 0.447501 | -0.281566 | -0.025807 |
| Size | 0.721781 | 1.000000 | -0.209697 | 0.815967 | -0.082897 | -0.022744 |
| Year | -0.102507 | -0.209697 | 1.000000 | -0.081013 | 0.036972 | -0.196787 |
| Total_rooms | 0.447501 | 0.815967 | -0.081013 | 1.000000 | -0.023466 | -0.031596 |
| Latitude | -0.281566 | -0.082897 | 0.036972 | -0.023466 | 1.000000 | 0.445697 |
| Longitude | -0.025807 | -0.022744 | -0.196787 | -0.031596 | 0.445697 | 1.000000 |
Let's look at the linear correlation of each attribute with the dependent variable: Price.
corr_mat.Price.sort_values(ascending=False)
Price 1.000000 Size 0.721781 Total_rooms 0.447501 Longitude -0.025807 Year -0.102507 Latitude -0.281566 Name: Price, dtype: float64
We can see how there is a strong correlation between teh size of the house and the price. This is to be expected as bigger huosing is more expensive, obviously. The number of rooms is also positively correlated.
The year that the building was constructed only seems to have a slight negative correlation. This correlation is likely not strong as old buildings tend to have better locations (e.g. center) than newer buildings.
The latitude (higher latitude, more north in this hemisphere) seems to be negatively correlated as the more expensive buildings are in the south, near the coast. The longitude does not seem to carry a large weight in the price.
scatter_matrix(housing, figsize=(12, 8));